The paper titled "Scaling Optimal LR Across Token Horizons" explores the relationship between learning rates (LR) and token horizons in the training of large language models (LLMs). The authors, Johan Bjorck and his colleagues, highlight the importance of scaling in LLMs, which involves increasing model size, dataset size, and computational resources. However, they note that tuning hyperparameters extensively for the largest models is often economically unfeasible. As a solution, they propose inferring or transferring hyperparameters from smaller experiments to larger ones. While previous research has addressed hyperparameter transfer across different model sizes, the authors identify a gap in the literature regarding hyperparameter transfer across varying dataset sizes or token horizons. To address this, they conduct a comprehensive empirical study to understand how the optimal learning rate varies with token horizon during LLM training. Their findings reveal that the optimal learning rate significantly decreases as the token horizon increases, indicating that longer training periods require smaller learning rates. The authors further establish that the optimal learning rate adheres to a scaling law, allowing for accurate estimation of the optimal learning rate for longer training horizons based on data from shorter ones. They propose a practical rule-of-thumb for transferring learning rates across different token horizons, which can be implemented without additional overhead in current practices. Additionally, they analyze the learning rate used in the LLama-1 model, suggesting that it was set too high and estimating the potential performance loss resulting from this miscalibration. In conclusion, the authors argue that hyperparameter transfer across dataset sizes is a critical yet often overlooked aspect of LLM training, emphasizing the need for further exploration in this area to enhance model performance and efficiency.